40 research outputs found
Prismer: A Vision-Language Model with An Ensemble of Experts
Recent vision-language models have shown impressive multi-modal generation
capabilities. However, typically they require training huge models on massive
datasets. As a more scalable alternative, we introduce Prismer, a data- and
parameter-efficient vision-language model that leverages an ensemble of domain
experts. Prismer only requires training of a small number of components, with
the majority of network weights inherited from readily-available, pre-trained
domain experts, and kept frozen during training. By leveraging experts from a
wide range of domains, we show that Prismer can efficiently pool this expert
knowledge and adapt it to various vision-language reasoning tasks. In our
experiments, we show that Prismer achieves fine-tuned and few-shot learning
performance which is competitive with current state-of-the-art models, whilst
requiring up to two orders of magnitude less training data. Code is available
at https://github.com/NVlabs/prismer.Comment: Tech Report. Project Page: https://shikun.io/projects/prismer Code:
https://github.com/NVlabs/prismer v2: fixed incorrect training cost estimate
and zero-shot NoCaps performance of SimVL
A Comparison between Deep Neural Nets and Kernel Acoustic Models for Speech Recognition
We study large-scale kernel methods for acoustic modeling and compare to DNNs
on performance metrics related to both acoustic modeling and recognition.
Measuring perplexity and frame-level classification accuracy, kernel-based
acoustic models are as effective as their DNN counterparts. However, on
token-error-rates DNN models can be significantly better. We have discovered
that this might be attributed to DNN's unique strength in reducing both the
perplexity and the entropy of the predicted posterior probabilities. Motivated
by our findings, we propose a new technique, entropy regularized perplexity,
for model selection. This technique can noticeably improve the recognition
performance of both types of models, and reduces the gap between them. While
effective on Broadcast News, this technique could be also applicable to other
tasks.Comment: arXiv admin note: text overlap with arXiv:1411.400
MimicGen: A Data Generation System for Scalable Robot Learning using Human Demonstrations
Imitation learning from a large set of human demonstrations has proved to be
an effective paradigm for building capable robot agents. However, the
demonstrations can be extremely costly and time-consuming to collect. We
introduce MimicGen, a system for automatically synthesizing large-scale, rich
datasets from only a small number of human demonstrations by adapting them to
new contexts. We use MimicGen to generate over 50K demonstrations across 18
tasks with diverse scene configurations, object instances, and robot arms from
just ~200 human demonstrations. We show that robot agents can be effectively
trained on this generated dataset by imitation learning to achieve strong
performance in long-horizon and high-precision tasks, such as multi-part
assembly and coffee preparation, across broad initial state distributions. We
further demonstrate that the effectiveness and utility of MimicGen data compare
favorably to collecting additional human demonstrations, making it a powerful
and economical approach towards scaling up robot learning. Datasets, simulation
environments, videos, and more at https://mimicgen.github.io .Comment: Conference on Robot Learning (CoRL) 202
Voyager: An Open-Ended Embodied Agent with Large Language Models
We introduce Voyager, the first LLM-powered embodied lifelong learning agent
in Minecraft that continuously explores the world, acquires diverse skills, and
makes novel discoveries without human intervention. Voyager consists of three
key components: 1) an automatic curriculum that maximizes exploration, 2) an
ever-growing skill library of executable code for storing and retrieving
complex behaviors, and 3) a new iterative prompting mechanism that incorporates
environment feedback, execution errors, and self-verification for program
improvement. Voyager interacts with GPT-4 via blackbox queries, which bypasses
the need for model parameter fine-tuning. The skills developed by Voyager are
temporally extended, interpretable, and compositional, which compounds the
agent's abilities rapidly and alleviates catastrophic forgetting. Empirically,
Voyager shows strong in-context lifelong learning capability and exhibits
exceptional proficiency in playing Minecraft. It obtains 3.3x more unique
items, travels 2.3x longer distances, and unlocks key tech tree milestones up
to 15.3x faster than prior SOTA. Voyager is able to utilize the learned skill
library in a new Minecraft world to solve novel tasks from scratch, while other
techniques struggle to generalize. We open-source our full codebase and prompts
at https://voyager.minedojo.org/.Comment: Project website and open-source codebase:
https://voyager.minedojo.org
MimicPlay: Long-Horizon Imitation Learning by Watching Human Play
Imitation learning from human demonstrations is a promising paradigm for
teaching robots manipulation skills in the real world. However, learning
complex long-horizon tasks often requires an unattainable amount of
demonstrations. To reduce the high data requirement, we resort to human play
data - video sequences of people freely interacting with the environment using
their hands. Even with different morphologies, we hypothesize that human play
data contain rich and salient information about physical interactions that can
readily facilitate robot policy learning. Motivated by this, we introduce a
hierarchical learning framework named MimicPlay that learns latent plans from
human play data to guide low-level visuomotor control trained on a small number
of teleoperated demonstrations. With systematic evaluations of 14 long-horizon
manipulation tasks in the real world, we show that MimicPlay outperforms
state-of-the-art imitation learning methods in task success rate,
generalization ability, and robustness to disturbances. Code and videos are
available at https://mimic-play.github.ioComment: 7th Conference on Robot Learning (CoRL 2023 oral presentation
Eureka: Human-Level Reward Design via Coding Large Language Models
Large Language Models (LLMs) have excelled as high-level semantic planners
for sequential decision-making tasks. However, harnessing them to learn complex
low-level manipulation tasks, such as dexterous pen spinning, remains an open
problem. We bridge this fundamental gap and present Eureka, a human-level
reward design algorithm powered by LLMs. Eureka exploits the remarkable
zero-shot generation, code-writing, and in-context improvement capabilities of
state-of-the-art LLMs, such as GPT-4, to perform evolutionary optimization over
reward code. The resulting rewards can then be used to acquire complex skills
via reinforcement learning. Without any task-specific prompting or pre-defined
reward templates, Eureka generates reward functions that outperform expert
human-engineered rewards. In a diverse suite of 29 open-source RL environments
that include 10 distinct robot morphologies, Eureka outperforms human experts
on 83% of the tasks, leading to an average normalized improvement of 52%. The
generality of Eureka also enables a new gradient-free in-context learning
approach to reinforcement learning from human feedback (RLHF), readily
incorporating human inputs to improve the quality and the safety of the
generated rewards without model updating. Finally, using Eureka rewards in a
curriculum learning setting, we demonstrate for the first time, a simulated
Shadow Hand capable of performing pen spinning tricks, adeptly manipulating a
pen in circles at rapid speed.Comment: Project website and open-source code:
https://eureka-research.github.io
VIMA: General Robot Manipulation with Multimodal Prompts
Prompt-based learning has emerged as a successful paradigm in natural
language processing, where a single general-purpose language model can be
instructed to perform any task specified by input prompts. Yet task
specification in robotics comes in various forms, such as imitating one-shot
demonstrations, following language instructions, and reaching visual goals.
They are often considered different tasks and tackled by specialized models. We
show that a wide spectrum of robot manipulation tasks can be expressed with
multimodal prompts, interleaving textual and visual tokens. Accordingly, we
develop a new simulation benchmark that consists of thousands of
procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert
trajectories for imitation learning, and a four-level evaluation protocol for
systematic generalization. We design a transformer-based robot agent, VIMA,
that processes these prompts and outputs motor actions autoregressively. VIMA
features a recipe that achieves strong model scalability and data efficiency.
It outperforms alternative designs in the hardest zero-shot generalization
setting by up to task success rate given the same training data.
With less training data, VIMA still performs better than
the best competing variant. Code and video demos are available at
https://vimalabs.github.io/Comment: ICML 2023 Camera-ready version. Project website:
https://vimalabs.github.io